What is Evaluation metrics and When to use Which metrics?

您所在的位置：网站首页 › evaluation metrics › What is Evaluation metrics and When to use Which metrics?

What is Evaluation metrics and When to use Which metrics?

#What is Evaluation metrics and When to use Which metrics? | 来源: 网络整理| 查看: 265

Evaluation metrics are used to evaluating machine learning models.We should know when to use which metrics and it depends mainly on what kind of targets(lables) we have.

Classification Problem :

AccuracyPrecisionRecallF1 scoreArea under the ROC curve or AUCLog lossPrecision at kAverage precision at kMean average precision at kGini Coefficient

Regression Problem :

Mean squared error (MSE)Mean absolute error (MAE)Root mean squared error (RMSE)Root mean squared logarithmic error (RMSLE)R²/Adjusted R²Mean percentage error (MPE)Mean absolute percentage error (MAPE)

Let’s say we have a binary classification problem, we have an equal number of cancer and non-cancer sample in the target column.Let’s say we have 200 positive samples and 200 negative samples.Training and validation have 100 positive and 100 negative each.When we have an equal number of positive and negative samples we generally use accuracy , precision , recall and f1.

Accuracy : It defines how accurate our model is.For example . If we correctly predict 80 samples correctly out of 100 and 20 incorrectly , then we have an accuracy of 80% or 0.80.

from sklearn import metrics

metrics.accuracy_score(y_train,y_pred)

where y_train is the actual label

y_pred is the prediction label

Precision : Let’s say we have a dataset that is skewed i.e the number of samples in one class outnumber the number of samples in another class by a huge margin.It is not good to use accuracy as it is not good representative of the data.We might get high accuracy on the training data but our model may not perform well on new data. So we look at Precision.

Let’s take an example of cancer dataset, the label is cancer then we consider positive class (1) and if if it is non-cancer ,then it is negative class (0).

True positive (TP): Given a dataset ,if our model predicts cancer and the actual target is also cancer , then it is called true positive.

True negative (TN) : Given a dataset ,if our model predicts non-cancer and the actual target is also non-cancer , then it is called true negative.

If our model predicts positive class correctly then it is true positive , and if our model predicts negative class correctly then it is true negative.

False positive (FP): Given a dataset ,if our model predicts cancer and the actual target is non-cancer , then it is called false positive.

False negative (FN): Given a dataset ,if our model predicts non-cancer and the actual target is cancer , then it is called false negative.

If our model predicts positive class incorrectly then it is false positive , and if our model predicts negative class incorrectly then it is false negative.

Precision = TP/(TP + FP)

Let’s say we classify 70 non-cancer label correctly out of 80 and 16 cancer label out of 20.Hence, we predict correctly 86 label out of 100.The accuracy is 86% or 0.86.But out of 100 samples , 10 are non-cancer are misclassified as cancer and 4 cancer sample are misclassified as as non-cancer.

TP:16

TN:70

FP:10

FN:4

Precision = 16/(16+10) =0. 6153 or 61.53% when it is trying to classify positive labels (cancer)

Recall :

Recall = TP/(TP+FN)

Recall = 0.8

It predicts 80% positive samples correctly.

Our model should have high precision and recall value.We usually choose a threshold of 0.5 but it is not always ideal and depending on this our precision and recall value can change.

F1 score:

F1 score is a combination of both precision and recall.This score takes both false positives and false negatives into account unlike precision and recall where they took only one. F1 score is usually more useful than accuracy, especially if we have an uneven class distribution.

F1 score = (2*Precision*Recall)/(Precision + Recall)

F1 = 2TP/(2TP+FP+FN)

Python implementation :

from sklearn import metrics

metrics.f1.score(y_true,y_pred)

Perfect model has an F1 score of 1.Higher the F1 score , better the model for skewed labels.

TPR(True Positive Rate) is same as recall.It is also called Sensitivity.

TPR = TP/(TP + FN)

FPR(False Positive Rate) also called Specificity.

FPR = FP /(TN+FP)

ROC : An ROC curve (receiver operating characteristic) is a graph showing the performance of a classification model at all classification thresholds.

AUC (Area under the ROC Curve) : AUC measures the entire two-dimensional area underneath the entire ROC curve.AUC is widely used metric for skewed binary classification problem.

In case of binary classification, we can choose the threshold using the ROC curve.The ROC curve tells us how the threshold impacts false positive rate and true positive rate and vice versa.

Predicition = Probability ≥ threshold

Log loss: Average of all individual log loss.Log loss penalizes very high for an incorrect or far-off prediction.We need to be very careful while using log loss.

For binary classification problem,

Log Loss = -1.0 * ( target * log(prediction) + (1 — target) * log(1-prediction))

where target is either 0 or 1 and prediction is a probability of a sample belonging to class 1.

When we have multi-label classification i.e each sample can have one or more class associated with it. For example , like Human Protein Multi Label Image Classification(https://www.kaggle.com/c/jovian-pytorch-z2g/overview/evaluation).We use the following evaluation metrics:

Precision at k (P@k) :

Precision at k is the proportion of recommended items in the top-k set that are relevant.It is the number of hits in the predicted list considering only the top-k predictions, divided by k.

Average Precision at k(AP@k) :

The mean of P@i for i=1, …, K.

For example, If we want to calculate AP@3: sum P@1, P@2 and P@3 and divide that value by 3.

Mean Average Precision at k (MAP@k) : MAP@k is just an average of AP@k.

For example : If we want to calculate MAP@3: sum AP@3 for all the users and divide that value by the amount of users.

P@k,AP@k and MAP@k all range from 0 to 1 and 1 being the best.

log loss for multi-label classification :

We can convert the labels into binary format and then use a log loss for each column.In the end we can take average of log loss in each column.This is called column-wise log loss.There are other ways.The standard way to train a multi-label classifier is with sigmoid and binary cross-entropy.

For regression evaluation metrics are discussed below:

Error : It is the most simplest way to calculate metrics.

Error = True Value — Predicted Value

2. Absolute Error : Absolute error is absolute of error.

Absolute Error = abs(True Value — Predicted Value)

3.Mean Absolute Error : It is mean of the absolute error.

Mean Absolute Error = abs(True Value — Predicted Value)/n

where n is the number of True value

4. Squared Error : It is squared of error.

Squared Error = (True Value — Predicted Value)²

5. Mean Squared Error :

MSE = Squared Error /n

where n is the number of True Value

6. RMSE (root mean squared error):

RMSE = Sqrt(MSE)

7. Percentage Error :

Percentage Error = ((True Value — Predicted Value)/True Value)*100

8. Mean Percentage Error :

MPE = (((True Value — Predicted Value)/True Value)/n)*100

where n is the number of True Value

9. Mean Absolute Percentage Error (MAPE) :

MAPE = ((abs(True Value — Predicted Value)/True Value)/n)*100

where n is the number of True Value

10. R² (R -squared) : It is also called coefficient of determination.

If we are building a linear regression on multiple variables, it is advised that we should use Adjusted R-squared to measure the goodness of model. In case we only have one input variable, R-square and Adjusted R squared would be exactly same.

11. Root Mean Square logarithmic error (RMSLE):

RMSLE is usually used when we don’t want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers.

MSE and RMSE are the most popular metrics used in evaluating Regression models.There are many other metrics and also some advanced metric which is used for regression.If we understand what metrics we are using then it helps to improve our model.

Reference: https://stackoverflow.com

【本文地址】

What is Evaluation metrics and When to use Which metrics?

What is Evaluation metrics and When to use Which metrics?

今日新闻

推荐新闻